MultiLoc2 predicts subcellular localizations for a given protein sequence. We would like to run the local version of this code over all possible human protein sequences to be safe.

To do this, we first need FASTA sequences of all human protein sequences.

Retreiving FASTA sequences

In a previous notebook all associated protein sequences for each Entrez ID were retreived. We require canonical sequences for each Entrez ID. These can be found using the Mapping files on the iRefIndex website. See this page for information on how iRefIndex handles canonicalisation.



In [1]:

    
cd ../../iRefIndex/









    



/home/gavin/Documents/MRes/iRefIndex



In [2]:

    
!head -n 3 mappings.txt









    



uniprotkb	Q7A656	-1	42465628	dkegLgczOIkeGMf3yLHiKaJt+8k158879	42465628	dkegLgczOIkeGMf3yLHiKaJt+8k158879
uniprotkb	Q7WHN0	-1	40036741	YuhoZtSpOImwA2EFdSUwNRN6PHI257310	40036741	YuhoZtSpOImwA2EFdSUwNRN6PHI257310
uniprotkb	Q9UQM7	-1	812068	7uql+9kLfmK9drY5L0oKOH7zPk49606	812068	7uql+9kLfmK9drY5L0oKOH7zPk49606

Columns 2 and 3 contain the external identifier and the Entrez Gene ID. We would like to match the human Gene IDs to the Refseq IDs. An easy way to do this is to download the Refseq FASTA entries for all human genes and then map these back to the Entrez Gene IDs with this table.



In [4]:

    
!wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz









    



--2014-06-30 19:20:08--  ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz
           => ‘human.protein.faa.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.7, 2607:f220:41e:250::10
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|130.14.250.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /refseq/H_sapiens/mRNA_Prot ... done.
==> SIZE human.protein.faa.gz ... 18544481
==> PASV ... done.    ==> RETR human.protein.faa.gz ... done.
Length: 18544481 (18M) (unauthoritative)

100%[======================================>] 18,544,481  1.70MB/s   in 12s    

2014-06-30 19:20:22 (1.49 MB/s) - ‘human.protein.faa.gz’ saved [18544481]



In [5]:

    
!gunzip human.protein.faa.gz

We can parse this FASTA file using Biopython:



In [4]:

    
from Bio import SeqIO



In [5]:

    
proteinrecords = list(SeqIO.parse("human.protein.faa","fasta"))

Printing an example entry to check:



In [6]:

    
print proteinrecords[0].id.split("|")[3]









    



NP_001005218.1

Matching to Entrez

Using the mapping table from iRefIndex we can match each Refseq ID back to it's Entrez ID. This gives a canonical Refseq entry for each Entrez ID.



In [7]:

    
refseqtoentry = {}
for r in proteinrecords:
    try:
        refseqtoentry[r.id.split("|")[3]] += [r]
    except KeyError:
        refseqtoentry[r.id.split("|")[3]] = [r]

Checking that each ID is unique as expected for Refseq IDs:



In [8]:

    
ridlengths = []
for k in refseqtoentry.keys():
    if len(refseqtoentry[k]) != 1:
        print "Entry {0} has length {1}".format(k,len(refseqtoentry[k]))

So everything is ok. Continuing taking these refseq IDs and mapping them onto Entrez IDs:



In [10]:

    
import csv



In [17]:

    
#prime the dictionary
refseqtoentrez={}
for k in refseqtoentry.keys():
    refseqtoentrez[k] = []



In [18]:

    
f = open("mappings.txt")
c = csv.reader(f,delimiter="\t")
lc = 0
for l in c:
    try:
        refseqtoentrez[l[1]] += [l[3]]
    except KeyError:
        #then it's not a human refseq
        pass
    if lc%10000000 == 0:
        print "Reached line {0}".format(lc)
    lc += 1
f.close()









    



Reached line 0
Reached line 10000000
Reached line 20000000
Reached line 30000000
Reached line 40000000
Reached line 50000000
Reached line 60000000
Reached line 70000000
Reached line 80000000
Reached line 90000000
Reached line 100000000
Reached line 110000000
Reached line 120000000



In [19]:

    
print "{0} Refseq IDs mapped to {1} Entrez IDs".format(len(refseqtoentrez.keys()),
                                                       len(list(flatten(refseqtoentrez.values()))))









    



71681 Refseq IDs mapped to 52789 Entrez IDs

Check that this mapping is 1 to 1:



In [21]:

    
ridlengths = []
for k in refseqtoentrez.keys():
    if len(refseqtoentrez[k]) > 1:
        print "Entry {0} has length {1}".format(k,len(refseqtoentrez[k]))

So all those that are not zero map precisely 1 to 1. All of those Entrez IDs are the proteins we would like to run through the MultiLoc2 script. Then, we can iterate over the these dictionaries and build a new dictionary relating Entrez IDs to corresponding canonical sequences:



In [12]:

    
entreztorecord = {}
for k in refseqtoentrez.keys():
    if refseqtoentrez[k]:
        entreztorecord[refseqtoentrez[k][0]] = refseqtoentry[k][0]

Storing the results

Once we've done this we can write the FASTA file to disk and pickle the dictionary so we can retrieve the Entrez Gene IDs:



In [12]:

    
import pickle



In [23]:

    
SeqIO.write(flatten(entreztorecord.values()),"human.canonical.refseq.fasta","fasta")









    Out[23]:





52789



In [16]:

    
f = open("human.canonical.entrez.pickle","wb")
pickle.dump(entreztorecord,f)
f.close()

The dictionary mapping 1 to 1 Entrez to canonical refseq sequences could also come in useful so that should also be stored:



In [25]:

    
f = open("human.canonical.refseqtoentrez.pickle","wb")
pickle.dump(refseqtoentrez,f)
f.close()

Testing MultiLoc2 local script

To install downloaded the tar file from here and installed the dependencies through the AUR:

libsvm
BLAST
InterPro - could not install through the AUR, would have to install from its website here
- Given that InterPro appears to be the database rather than the application and the application InterProScan is extremely large not installing this at the moment.
- Will try to install after testing.

Then ran the configuration script:



In [36]:

    
cd ../multiloc2/MultiLoc2-26-10-2009/









    



/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009



In [37]:

    
!python2 configureML2.py









    



Start MultiLoc2 configuration:

searching for svm-predict ... 
... found: /usr/bin/svm-predict

searching for blastall ... 
... found: /usr/bin/blastall

searching for formatdb ... 
... found: /usr/bin/formatdb

searching for iprscan ... 
which: no iprscan in (/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/gavin/.cabal/bin:/usr/local/bin:/home/gavin/.cabal/bin:/usr/local/bin)
Warning: iprscan not found!
It is recommended but not required to install iprscan and to include its path into your $PATH variable!

set all static paths in source files ...
... create script run_multiloc2_with_iprscan
... completed
MultiLoc2 is ready to use!

First, testing the script on just a few entries:



In [38]:

    
!python2 src/multiloc2_prediction.py









    



usage:
python src/multiloc2_prediction.py -fasta=<fasta file> -origin=<animal|plant|fungal> -result=<result file> [-predictor=<LowRes|HighRes>] [-output=<simple|advanced>] [[-go=<go file>] ... ]



In [41]:

    
#writing a small FASTA file here:
testlist = entreztorecord.values()[0:10]
SeqIO.write(testlist,"human.testlist.fasta","fasta")









    Out[41]:





10



In [43]:

    
!python2 src/multiloc2_prediction.py -fasta=human.testlist.fasta -origin=animal -result=human.testlist.multiloc2









    



create feature vectors
run SVMTarget
run SVMSA
run SVMaac
skip GOLoc
run PhyloLoc
Selenocysteine (U) at position 37 replaced by X
run MotifSearch
run MultiLoc2

Running in parallel may be faster:



In [45]:

    
#write 10 small files:
for x,r in zip(range(10),testlist):
    SeqIO.write([r],"human.{0}.fasta".format(x),"fasta")



In [68]:

    
#write a script to execute all of the files:
f = open("runmultiloc2.sh", "w")
f.write(r"#!/bin/bash")
f.write("\n")
for x in range(10):
    f.write("python2 src/multiloc2_prediction.py -fasta=human.{0}.fasta -origin=animal -result=human.{0}.multiloc2".format(x))
    f.write(" & \n")
f.write("wait\n")
f.write("echo complete")
f.close()



In [69]:

    
!cat runmultiloc2.sh









    



#!/bin/bash
python2 src/multiloc2_prediction.py -fasta=human.0.fasta -origin=animal -result=human.0.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.1.fasta -origin=animal -result=human.1.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.2.fasta -origin=animal -result=human.2.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.3.fasta -origin=animal -result=human.3.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.4.fasta -origin=animal -result=human.4.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.5.fasta -origin=animal -result=human.5.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.6.fasta -origin=animal -result=human.6.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.7.fasta -origin=animal -result=human.7.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.8.fasta -origin=animal -result=human.8.multiloc2 & 
python2 src/multiloc2_prediction.py -fasta=human.9.fasta -origin=animal -result=human.9.multiloc2 & 
wait
echo complete



In [70]:

    
!chmod +x runmultiloc2.sh



In [ ]:

    
!./runmultiloc2.sh









    



create feature vectors
run SVMTarget
create feature vectors
create feature vectors
run SVMTarget
create feature vectors
run SVMTarget
create feature vectors
run SVMTarget
create feature vectors
run SVMTarget
run SVMTarget
create feature vectors
create feature vectors
run SVMTarget
create feature vectors
create feature vectors
run SVMTarget
run SVMTarget
run SVMTarget
Traceback (most recent call last):
  File "src/multiloc2_prediction.py", line 662, in <module>
    main()
  File "src/multiloc2_prediction.py", line 605, in main
    vec = multiloc2_create_feature_vector(predictor,origin,file,go_file_names, 12345,prediction_id)
  File "src/multiloc2_prediction.py", line 87, in multiloc2_create_feature_vector
    result_svmtarget=svm_target.noplant_predict(fastafile,model,svm_model_path,libsvm_path, prediction_id)		
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 563, in noplant_predict
    return predict("noplant",data,model,svm_model_path,libsvm_path,id)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 510, in predict
    create_svm2_input("non-plant",file_path,input_file,None,proteins,"0",firstN_sp,firstN_mtp,firstN_ctp,window_size_sp,window_size_mtp,window_size_ctp,svm_sp_vs_other,svm_mtp_vs_other,None, svm_sp_vs_mtp, None,svm_mtp_vs_sp, None, None)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 273, in create_svm2_input
    file_output_sp_vs_other = open("%soutput_svm_sp_vs_other.dat" % file_path, 'r')
IOError: [Errno 2] No such file or directory: '/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/tmp//ml1404158467.35output_svm_sp_vs_other.dat'
run SVMSA
run SVMSA
run SVMSA
Traceback (most recent call last):
  File "src/multiloc2_prediction.py", line 662, in <module>
    main()
  File "src/multiloc2_prediction.py", line 605, in main
    vec = multiloc2_create_feature_vector(predictor,origin,file,go_file_names, 12345,prediction_id)
  File "src/multiloc2_prediction.py", line 87, in multiloc2_create_feature_vector
    result_svmtarget=svm_target.noplant_predict(fastafile,model,svm_model_path,libsvm_path, prediction_id)		
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 563, in noplant_predict
    return predict("noplant",data,model,svm_model_path,libsvm_path,id)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 510, in predict
    create_svm2_input("non-plant",file_path,input_file,None,proteins,"0",firstN_sp,firstN_mtp,firstN_ctp,window_size_sp,window_size_mtp,window_size_ctp,svm_sp_vs_other,svm_mtp_vs_other,None, svm_sp_vs_mtp, None,svm_mtp_vs_sp, None, None)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 273, in create_svm2_input
    file_output_sp_vs_other = open("%soutput_svm_sp_vs_other.dat" % file_path, 'r')
IOError: [Errno 2] No such file or directory: '/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/tmp//ml1404158467.38output_svm_sp_vs_other.dat'
run SVMSA
run SVMSA
Traceback (most recent call last):
  File "src/multiloc2_prediction.py", line 662, in <module>
    main()
  File "src/multiloc2_prediction.py", line 605, in main
    vec = multiloc2_create_feature_vector(predictor,origin,file,go_file_names, 12345,prediction_id)
  File "src/multiloc2_prediction.py", line 87, in multiloc2_create_feature_vector
    result_svmtarget=svm_target.noplant_predict(fastafile,model,svm_model_path,libsvm_path, prediction_id)		
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 563, in noplant_predict
    return predict("noplant",data,model,svm_model_path,libsvm_path,id)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 510, in predict
    create_svm2_input("non-plant",file_path,input_file,None,proteins,"0",firstN_sp,firstN_mtp,firstN_ctp,window_size_sp,window_size_mtp,window_size_ctp,svm_sp_vs_other,svm_mtp_vs_other,None, svm_sp_vs_mtp, None,svm_mtp_vs_sp, None, None)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_target.py", line 273, in create_svm2_input
    file_output_sp_vs_other = open("%soutput_svm_sp_vs_other.dat" % file_path, 'r')
IOError: [Errno 2] No such file or directory: '/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/tmp//ml1404158467.36output_svm_sp_vs_other.dat'
run SVMSA
run SVMSA
run SVMaac
run SVMaac
run SVMaac
run SVMaac
run SVMaac
run SVMaac
run SVMaac
skip GOLoc
skip GOLoc
skip GOLoc
skip GOLoc
skip GOLoc
skip GOLoc
skip GOLoc
run PhyloLoc
run PhyloLoc
run PhyloLoc
run PhyloLoc
run PhyloLoc
run PhyloLoc
run PhyloLoc
run MotifSearch
run MultiLoc2
run MotifSearch
run MultiLoc2
run MotifSearch
run MultiLoc2
run MotifSearch
run MultiLoc2
run MotifSearch
run MultiLoc2

Cleaning up:



In [73]:

    
!rm human.*

Testing the execution time

Since that's not going to work, we might as well find out how long it's going to take to actually run this thing.



In [72]:

    
import time



In [77]:

    
testlist = entreztorecord.values()[0:1000]
SeqIO.write(testlist,"human.testlist.fasta","fasta")









    Out[77]:





1000



In [78]:

    
pre = time.time()



In [79]:

    
!python2 src/multiloc2_prediction.py -fasta=human.testlist.fasta -origin=animal -result=human.testlist.multiloc2









    



create feature vectors
run SVMTarget
run SVMSA
run SVMaac
skip GOLoc
run PhyloLoc
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 666 replaced by X
Selenocysteine (U) at position 43 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 197 replaced by X
Selenocysteine (U) at position 14 replaced by X
Selenocysteine (U) at position 65 replaced by X
Selenocysteine (U) at position 13 replaced by X
Selenocysteine (U) at position 17 replaced by X
Selenocysteine (U) at position 255 replaced by X
Selenocysteine (U) at position 49 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 13 replaced by X
Selenocysteine (U) at position 614 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 127 replaced by X
Selenocysteine (U) at position 462 replaced by X
Selenocysteine (U) at position 428 replaced by X
Selenocysteine (U) at position 428 replaced by X
Selenocysteine (U) at position 127 replaced by X
Selenocysteine (U) at position 462 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 197 replaced by X
Selenocysteine (U) at position 428 replaced by X
Selenocysteine (U) at position 127 replaced by X
Selenocysteine (U) at position 462 replaced by X
Selenocysteine (U) at position 133 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 133 replaced by X
Selenocysteine (U) at position 16 replaced by X
Selenocysteine (U) at position 7 replaced by X
Selenocysteine (U) at position 17 replaced by X
Selenocysteine (U) at position 34 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 43 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 255 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 133 replaced by X
Selenocysteine (U) at position 120 replaced by X
Selenocysteine (U) at position 197 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 195 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 7 replaced by X
Selenocysteine (U) at position 17 replaced by X
Selenocysteine (U) at position 34 replaced by X
Selenocysteine (U) at position 7 replaced by X
Selenocysteine (U) at position 17 replaced by X
Selenocysteine (U) at position 34 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 120 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 133 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 666 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 388 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 43 replaced by X
Selenocysteine (U) at position 133 replaced by X
Selenocysteine (U) at position 255 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 197 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 388 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 197 replaced by X
Selenocysteine (U) at position 43 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 273 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 43 replaced by X
Selenocysteine (U) at position 120 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 197 replaced by X
Selenocysteine (U) at position 388 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 7 replaced by X
Selenocysteine (U) at position 17 replaced by X
Selenocysteine (U) at position 34 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 255 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 120 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 37 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 43 replaced by X
Selenocysteine (U) at position 120 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 120 replaced by X
Selenocysteine (U) at position 197 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 391 replaced by X
Selenocysteine (U) at position 121 replaced by X
Selenocysteine (U) at position 7 replaced by X
Selenocysteine (U) at position 17 replaced by X
Selenocysteine (U) at position 34 replaced by X
Selenocysteine (U) at position 255 replaced by X
Selenocysteine (U) at position 200 replaced by X
Selenocysteine (U) at position 132 replaced by X
Selenocysteine (U) at position 120 replaced by X
Selenocysteine (U) at position 255 replaced by X
Selenocysteine (U) at position 7 replaced by X
Selenocysteine (U) at position 17 replaced by X
Selenocysteine (U) at position 34 replaced by X
Selenocysteine (U) at position 273 replaced by X
Traceback (most recent call last):
  File "src/multiloc2_prediction.py", line 662, in <module>
    main()
  File "src/multiloc2_prediction.py", line 605, in main
    vec = multiloc2_create_feature_vector(predictor,origin,file,go_file_names, 12345,prediction_id)
  File "src/multiloc2_prediction.py", line 115, in multiloc2_create_feature_vector
    result_svm_phyloloc = svm_phyloloc.animal_predict(table,svm_model_path,fastafile,model,libsvm_path,blast_path,genome_path,prediction_id)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_phyloloc.py", line 218, in animal_predict
    return predict("animal",table,path,data,model,libsvm_path,blast_path,genome_path, id)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_phyloloc.py", line 207, in predict
    fv = create_feature_vector(proteins[i]['id'],proteins[i]['sequence'],blast_path,genome_path,id)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_phyloloc.py", line 177, in create_feature_vector
    fvs = createProfile(inputfile,blast_path,genome_path,id)
  File "/home/gavin/Documents/MRes/multiloc2/MultiLoc2-26-10-2009/src/svm_phyloloc.py", line 134, in createProfile
    proteins2[id][i]=bit_score
KeyError: 'gi|31791026|ref|NP_853640.1|'



In [80]:

    
post = time.time()

As you can see, it had an error so hasn't executed.

Using pre-computed MultiLoc2 results from ENTs

The ENTs paper uses results for the MultiLoc2 script for the same task we are working on: protein-protein interaction prediction. Instead of using the MultiLoc2 script we can therefore just use the precomputed results in the file found in ENTs standalone download here.



In [81]:

    
cd ../../ents/standalone/









    



/home/gavin/Documents/MRes/ents/standalone



In [82]:

    
ls









    



Arabidopsis_trained.RData             multiloc2.pyc
A_thaliana_domains.out                populus_subcellular_multiloc2.out
a_thaliana_subcellular_multiloc2.out  pred_file_maker.py
bioinformatics_supplemental.pdf       pred_file_maker.pyc
DataFrame.py                          P_trichocarpa_domains.out
DataFrame.pyc                         run_genome_prob.py
domain_odds.in                        run_genome.py
domine_flat_file.in                   R_utilities.py
H_sapiens_domains.out                 R_utilities.pyc
h_sapiens_subcellular_multiloc2.out   S_cerevisiae_domains.out
human_trained.RData                   s_cerevisiae_subcellular_multiloc2.out
M_musculus_domains.out                test_predictions/
m_musculus_subcellular_multiloc2.out  yeast_trained.RData
multiloc2.py



In [83]:

    
!head h_sapiens_subcellular_multiloc2.out









    



MultiLoc2 Prediction Result

predictor = MultiLoc2-HighRes
origin = animal

ENSP00000442112	extracellular: 0.31	peroxisomal: 0.18	cytoplasmic: 0.13	mitochondrial: 0.1	lysosomal: 0.08	plasma membrane: 0.07	ER: 0.07	Golgi apparatus: 0.05	nuclear: 0.01
ENSP00000371119	extracellular: 0.81	plasma membrane: 0.06	peroxisomal: 0.04	mitochondrial: 0.02	cytoplasmic: 0.02	lysosomal: 0.02	ER: 0.02	Golgi apparatus: 0.01	nuclear: 0.0
ENSP00000392204	cytoplasmic: 0.6	nuclear: 0.2	mitochondrial: 0.1	peroxisomal: 0.05	plasma membrane: 0.02	extracellular: 0.02	Golgi apparatus: 0.01	ER: 0.01	lysosomal: 0.0
ENSP00000420823	cytoplasmic: 0.62	peroxisomal: 0.28	nuclear: 0.05	mitochondrial: 0.04	extracellular: 0.0	Golgi apparatus: 0.0	plasma membrane: 0.0	ER: 0.0	lysosomal: 0.0
ENSP00000322899	cytoplasmic: 0.67	mitochondrial: 0.18	peroxisomal: 0.06	nuclear: 0.05	ER: 0.02	extracellular: 0.01	Golgi apparatus: 0.01	plasma membrane: 0.01	lysosomal: 0.0

So what's this identifier they're using, example from above: ENSP00000442112. Looks like a human protein Ensembl ID. So I just need to map from the list of canonical human proteins we have to this.

We would like to use the same features used in this paper as they appear to be effective, so we have to use a similar method to parse the files in the ENTs folder to that used in the paper. To do this, will have to inspect the code.

Opening an new notebook to repurpose the ENTs code into our pipeline.